Hi! This is RSTudio!
2+2
## [1] 4
Many formats for data!
Here (and in all DA classes), we only care about RECTANGULAR DATA (“tidy data”):
3 things:
Big breakdown: numerical (quantitative) or categorical (qualitative) ie: number or not!
All together, 4 types:
discrete numerical data. “countable”. Actually counting real object/things/ideas.
continuous numerical data. “measureable”. Eg: weight, volume, duration of time, etc.
Philosopical note: be careful about WHAT you're measuring. Ex:
Age = integer number of years. discrete.
Age = duration of time living on earth. continuous.
height = measuring tape. continuous. Possible to be 66.1", 66.7", 69.23423423"
Weird example: money. gas? pennies? currency conversions? interest!
Class agreement: money is continous. nominal categorical. categorical, no INTRINSIC order.
ordinal categorical. categrocial, YES natural order
MORAL OF THE STORY: Not all data is the same. We use DIFFERENT TOOLS for different types!
Summary statistics for Categorical data: proportion!
Ex: In our class, 9/15 students had black hair.
phat:
9/15
## [1] 0.6
Summary statistics for Quantitative data:
For quant variables, always describe:
Note: it doesn’t make sense to talk about ONE number in a dataset. Instead, we care about ALL of them, at the same time! “Distribution”.
ALWAYS MAKE A PICTURE FIRST!
For numerical data: HISTOGRAM!
Ex: class data:
head(ourData)
## X Year.in.School Major Number.of.Siblings Name.of.Hometown
## 1 Student 01 Sophomore DA 2 Chicago
## 2 Student 02 Sophomore Physics 1 Hilliard, OH
## 3 Student 03 Sophomore DA 3 Hillsborough, CA
## 4 Student 04 Junior Bio 1 Vungtau, Vietnam
## 5 Student 05 Sophomore Econ 1 Lakeville, MN
## 6 Student 06 Sophomore CS 1 Buon Ma thuot, Vietnam
## Population.of.Hometown Hair.Color Age Height..inches.
## 1 2,700,000 Brown 20 74
## 2 37,114 Black 20 70
## 3 11,000 Brown 20 70
## 4 454,000 Black 20 65
## 5 72,000 Brown 19 70
## 6 502,175 Black 21 69
## Favorite.Childhood.Activity
## 1 Sports
## 2 video games
## 3 Sailing
## 4 video games
## 5 Playing sports
## 6 playing soccer, chess, boxing
Q: what can we say about your height?
Hist!
hist(ourData$Height..inches.)
Observations:
- typical height is around 67/68 - spread: looks like “most” values are
btwn 62 - 72 - looks symmetrical “ish”
Q: what if we use more than 5 rectangles?
hist(ourData$Height..inches., breaks = 10)
Interesting new feature! Bimodal! Q: could it be because of sex?
MORAL OF THE STORY: Follow the “Tao of the Histogram”!
Ex: our siblings
hist(ourData$Number.of.Siblings, breaks = c(0,.99,1.99,2.99, 3.99))
Tao of the histogram!
Ie, numerical summary of distribution.
Range = max - min. Con: highly susceptiple to outliers.
Standard deviation = “average distance btwn points and the mean” Formula: on the board Note: must square to measure the TOTAL variation!
IQR. Quartiles: cutoffs for 25% (Q1) and 75% (Q3) in the dataset. IQR = Q3 - Q1
Ex: Data = {15, 16, 16, 18, 20, 27}
Compute IQR = 20 - 16 = 4
(M = 17, Q1 = 16, Q3 = 20)
Meaning: range of the MIDDLE 50% of the data!
A statistic is ROBUST if it is NOT strongly affected by skew or outliers.
Ex: Data = {1,1,1,1,1,19}
Mean = 4, Med = 1. Seems like median is better!
Distribution: The salaries of all employees at a major corporation.
Which would provide a rosier picture of compensation: the mean or the median?
A: expect strong right skew (tail)! expect mean to be bigger (since not robust).
Q: what “counts” as an outlier? Need a test!
Q: There are two sections of calculus. Both take the same exam, both class means are the same.
Is it possible that a student who scores 95 in one section could be an outlier, but a student with same score in the other is NOT an outlier?
Yes! Must factor in the spread!
Let’s use IQR! (robust!)
“Cutoffs” for outliers:
low = Q1 - 1.5(Q3-Q1) IQR hi =. Q3 + 1.5(Q3-Q1) IQR
Ex: Data = { 1, 2, 3, 4, 5, 6, 7, 8, 8, 20}
Q: are there outliers?
M = 5.5 Q1 = 3, Q3 = 8, IQR = 8-3 = 5
Cutoffs:
low:
3 - 1.5*5
## [1] -4.5
hi:
8 + 1.5*5
## [1] 15.5
Great for comparison!
Ex: iris dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Ex: sepal length
boxplot(iris$Sepal.Length, horizontal = T)
Q: compare species?
boxplot(iris$Sepal.Length ~ iris$Species, horizontal = T)
Plan: stay to close to book!
Key building blocks:
A random process / chance experiment is any scenario where more than one unknown outcome could occur!
Ex: Tossing a coin, rolling dice, sampling random student
Sample space = S = the set of all possible outcomes Event. = A = any subset of the sample space (what we care about!)
Ex: Rolling a dice. Let A be the event that we get an even number. Let B be the event that we get a prime.
S = {1,2,3,4,5,6} A = {2,4,6} B = {2,3,5}
goal: combine events into useful statements
3 tools:
Find:
A’ = {1,3,5} A U B = {2,3,4,5,6} A intersect B = {2}
2.2 Axioms of probability
The very basics!
For some event A, the probability of event A, denoted P(A), satisifies…
1). P(A) >= 0 (for any event A in our sample space S) 2). P(S) = 1. (since S is all possibilities) 3). If A, B are mutually exclusive, then:
P(A U B) = P(A) + P(B)
Ex) Using only the axioms, prove that P(A’) = 1 - P(A)
Ex) Axioms, prove P(A) <= 1 for any A
Ex). P(A U B) = P(A) + P(B) - P(A and B)
Exercise: prove from axioms! (it’s in the book!)
Ex). Cards! If we draw at random, what’s the prob that our card is a heart or a face?
P(H U F) = P(H) + P(F) - P(H and F)
= 13/52 + 12/52 - 3/52
The PRODUCT RULE: if we make a sequence of k decisions, - with n1 options for the first, - n2 options for the second, …. - nk options for the kth decision.
If the decision at any one stage doesn’t affect any other, then there are:
n = n1 * n2 * n3 * ... * nk total possibilities
Ex). We toss a coin 3 times. How many possible outcomes? Ex: H H H
Ex). Roll 2 dice. How many outcomes?
6 * 6 = 36
Ex). In OH, license plates have 3 letters followed by 4 numbers.
How many plates are possible?
26^3*10^4
## [1] 175760000
Ex). Above, but no letter “O”, can’t end on an odd number.
Ex) How many arrangements of the letters in ALICE are there?
Ex) How many arrangments of the letters in BANANA are there?
Ex) There are 20 children. 3 swings. How many seating arrangements are there?
Ex). There are 20 children. We need to pick a team of 3 children. How many teams?
Problem: teams have no arrangement/order!
What’s the difference? ORDER!!!
Perms - order matters!
Combs - doesn’t
Ex). In a tennis league, there are 16 players. How many possible matchings are there?
Ex2). Same league. After a competition, someone gets 1st place, 2nd place, 3rd place. How many possible awards brackets?
Ex3). We toss 8 coins. What’s the probability of getting:
H H H H T T T T
H T H T H T H T
Ex4). We toss 8 coins. WHat’s the probability of getting exactly 4 heads?
P(Exactly 4 heads) = 70/2^8
70/2^8
## [1] 0.2734375
Ex5) We toss 10 coins. What’s the probabilit of exactly 3 Heads?
Future: binomial dist!
Alternate version:
Ex3). Roll a dice 10 times. What’s the probability that the first 3 rolls show “4”?, no others are 4.
Ex: 4 4 4 1 2 3 2 1 5 6
Ex4). What’s the prob that the LAST 3 rolls show “4”?
Same thing!
Ex5). What’s the prob that EXACTLY 3 of the 10 rolls shows “4”?
Future: binomial distribution
Ex6). Cards.
Draw a 5-card hand (from 52 card deck).
a). How many possible hands.
choose(52,5)
## [1] 2598960
What’s the probability of getting one pair?
Ex: 3 3 J A 7
NO: 3 3 3 A 7
No: 3 3 3 3 K
NO: 3 3 J J K
Decisions:
- decide which kind for the pair? 13 C 1
- decide which for the pair? 4 C 2
- decide kinds for other 3 cards: 12 C 3
- for each of the three: which suit? 4C1 * 4C1 * 4C1
Final: 13C1 * 4C2 * 12C3 * 4C1 * 4C1 * 4C1
Practice: wikipedia! Google: “wiki poker probabilities”
Idea: if it’s cloudy, more likely to rain.
P(A | B) = "the probability that A happens, GIVEN that B has occurred"
= P(A and B) / P(B)
Ex) At DU, 52% of students are female, 48% male. About 8.2% of all students are female STEM majors.
If we select a student at random, and that student is female, what’s the probability that she’s a STEM major?
given: female!
Need: P(S | F) = P(S and F)/P(F) = .082/.52
Ex) Lie detector test. It advertises itself as 95% accurate. This means that IF a person is lying, then there’s a 95% chance of a + result.
However, we also know that if a person is NOT lying, there’s a 90% chance the test shows -.
Suppose that most people (99%) are honest.
b). What’s the prob a rando person tests “+”?
Only options: L and +, or. L’ and +
.01*.95 + .99*.1
## [1] 0.1085
This is “Total Prob!”. Breaking it down into cases.
P(L | + ) = P( L and + )/P(+)
Tree:
.01*.95/(.01*.95 + .99*.1)
## [1] 0.0875576
If you get +, there’s only an 8.7% chance that you really lied!
Q: Why so small?
Idea: somtimes, one outcome DOES NOT AFFECT another!
Independence!
Ex: coin tosses.
Math definition:
We say events A,B are INDEPENDENT if:
P(A | B ) = P(A)
P(B | A ) = P(B)
if one is true, so is the other.
Ex: poker and snacks
P(P|C)
12/30
## [1] 0.4
P(P)
25/115
## [1] 0.2173913
Q: are these different enough to suggest not independent?
Ex). Text (Example 2.36)
Independence:
P(A | B) = P(A)
P(B | A) = P(B)
EASY WAY TO CHECK:
P(A and B) = P(A)*P(B)
Ex) Cards! Are the events “draw a heart” and “draw a face card” independent?
Justify mathematically.
Check: P(H and F) =? P(H)*P(F)
3/52 =? 13/52 * 12/52
3/52
## [1] 0.05769231
13/52 * 12/52
## [1] 0.05769231
Independent events!
Idea: think more holistically about probability.
RANDOM VARIABLE: map each outcome in sample space to real number.
Ex1. Roll dice. Let X = the number that shows. Possible values for X = {1,2,3,4,5,6}
Ex2. Roll two dice. Let X = their sum. values for X = {2,3,4,…,11,12}
Ex3. Toss a coin repeatedly until get H. Then stop. Let X = the number of coin tosses. {1, 2, 3, …. }
A discrete PROBABILITY DISTRIBUTION for a RV X is:
Notes:
the SUPPORT for the distribution is the set of all X with nonzero prob {1,2,3,4,5,6}
0 <= P(x) <= 1 for all x in support
sum( P(x) ) = 1
Ex1-3) Construct distributions for the RVs listed above. [Solved on the board].
CDF - cumulative distribution function
on board
Notes about CDF:
Why cdf?
A: very useful for RANGES of options for X
Ie:
P(a <= X <= b) = F(b) - F(a-) <- book, p106
Ex: What’s the prob we roll a number btwn 2 and 5 (inclusive)?
P(2 <= X <= 5) = F(5) - F(1) = 5/6 - 1/6 = 4/6
Before: described data that happenED
Now: expectations for the future! prob!
E[X] = sum( x*P(X=x) )
Ex). Dice Distro.
1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6
## [1] 3.5
If we rolled the dice MANY times, we’d expect an average close to 3.5.
V[X] =
Ex) Find V for dice distro.
V[X]
(1-3.5)^2*1/6 + (2-3.5)^2*1/6 + (3-3.5)^2*1/6 + (4-3.5)^2*1/6 + (5-3.5)^2*1/6 + (6-3.5)^2*1/6
## [1] 2.916667
SD[X]
sqrt(2.91666666)
## [1] 1.707825
E[ h(X) ] = sum( h(x)*P(X=x) )
E[ X^2 ] = sum( x^2 * P(X=x) )
Linear trans: (p115)
Y = aX1 + bX2 + c
where X1, X2 are RVs, and a,b,c are constants
then:
E[Y] = a * E[X1] + b * E[X2] + c
V[Y] = a^2 * V[X1] + b^2 * V[X2]
SD[Y] = sqrt(V[Y]). (WARNING: must find V first!)
Ex). A grad program values the math portion of entrance exam twice as much as the verbal.
Transf:
Y = 2*X1 + 1*X2
where X1 = math score, X2 = verbal
Info:
math: mean = 120, stdev = 5
verb: mean = 160, stdev = 7
Q: What are the mean and stdev of weighted score?
E[Y] = 2E[X1] + 1E[X2] = 2120 + 1160
2*120 + 1*160
## [1] 400
V[Y] = 2^2V[X1] + 1^2V[X2] = 2^2 * 5^2 + 1^2 * 7^2
2^2 * 5^2 + 1^2 * 7^2
## [1] 149
stdev:
sqrt(149)
## [1] 12.20656
Note: must get V first!
Bernoulli.
X ~ bernoulli(p) [NOTE: p is a “parameter” ]
Only two outcomes: X=0 (fail) or X=1 (success).
P(X = 1) = p (prob of success)
P(X = 0) = 1-p
E[X] = p
V[X] = p(1-p) <- check later!
Binomial Dist
X ~ binom(n,p)
ASSUMPTIONS:
Ex). If we roll dice 10 times, what’s the prob that exactly 3 of them show “4”?
X~binom(10,1/6)
Bernoulli
Support: {0,1}
2). Binomial
X~binom(n,p)
X counts the number of successes out of n trials.
Ex). At DU, about 80% of students are from out-of-state.
If we take a sample of 6 students, what’s the prob that exactly 3 of them are from out-of-state?
X ~ binom(6, .8)
P(X=3)
20*.8^3*.2^3
## [1] 0.08192
Binomial facts:
Support: {0,1,2,..,n}
E[X] = n*p
V[X] = n*p*(1-p)
X ~ geom(p)
X = the number of trials needed until the first success.
Ex: We toss a coin repeatedly until get H. WHat’s the prob it takes 5 times?
P(T and T and T and T and H) = (1/2)^4 * 1/2
Ex). Suppose we choose students at random until we get one from Ohio. (remember: 80% from out-of-state).
What’s the prob that we must sample at least 3 students until we get the first from Ohio?
Support: {1,2,3,….}
Expected Value: E[X] = 1/p. ONLY FOR GEOMETRIC! Variance: V[X] = (1-p)/p^2
X ~ pois(mu)
X = the number of occurrences of a random event over an interval of space/time
Assumptions:
Ex). Since 1883, Tokyo has experienced 5 earthquakes of the “most severe” category.
Assuming constant likelihood over time, what’s the probability that at least one occurs in the next year? In the next 10 years?
mu = 5quakes/140years = 5/140
P(X>=1) = P(X=1) + P(X=2) + ….. = 1 - P(X=0)
1-exp(-5/140)*(5/140)^0/1
## [1] 0.03508406
b). next 10 years?
idea: unit changed! want quakes/10year period
mu = 10*5/140 = 50/140
1-exp(-50/140)*(50/140)^0/1
## [1] 0.3003275
Ex. Height is cts.
In US, avg height for women is 64”, stdev = 2.4”.
What’s the prob that a rando woman is EXACTLY 64.0000000000….”
A: zero! weird!
True for ALL cts dist:
P(X = a) = 0 for any a!
So, only makes sense to talk about RANGES for cts variables.
What to do? probability DENSITY function!
Idea: need a DENSITY function. f(x). pdf. probability = AREA under the curve!
“straight line dist”
Ex) Suppose the time you wait at a bus stop varies uniformly btwn 5 and 15 min.
X~unif(5,15)
a). sketch the pdf
c). What’s P(X<=7)
d). What’s P(X=7) = 0
In general:
X cts: P(a <= X <= b) = P(a < X < b)
WARNING: not true for X discrete!
Ex). Suppose f(x) = kx^2 0 <= x <= 2. (0 otherwise.) What value of k makes f(x) a valid pdf?
Ex). Compute P(X<1) = F(1)
cdf = all probability to the lEFT
F(x) = P(X <= x)
Ex). Find cdf for above dist.
f(x) = (3/8) x^2 0 <= x <= 2. (0 otherwise.)
Ex). Find P( 1/2 < X < 1 ). = F(1) - F(1/2)
= 1^3/8 - (1/2)^3/8
Ex). Find cdf for X~unif(a,b)
Before: X discrete:
E[X] = sum( x*P(x) )
Now: replace sum with integral!
V[X] = sum( (x-mu)^2 * P(x) )
Now: replace with integrals!
Ex. Compute E[X] for above
f(x) = 3/8 x^2 0 <= x <= 2. (0 otherwise.)
4.13e) (p155) What’s the prob that headway (X) is within 1 standard deviation of the mean value?
Ex). Draw a 5-card hand at random. What’s prob of 1 pair?
Ex: 3 3 7 J Q
NO: 3 3 3 J Q
NO: 3 3 7 7 Q
Decisions:
which kind for pair? 13C1
which 3s? 4C2
what kind for remaining? 12 C 3
for each card: 4C1. (three of them)
Prob:
( 13C1 * 4C2 * 12C3 * (4C1)^3 ) / 52C5
(number of ways for 1 pair) / total number of 5-card hands
Since 1883, tokyo had 5 serious quakes. What’s the prob of at least 8 quakes in the next year?
X ~ pois(mu=5/140)
P(X >= 8) = P(8) + P(9) + P(10) + …
= 1 - P(7) - P(6) - ... - P(1) - P(0)
Ex) Draw a 5-card hand. What’s the prob of getting 1 pair?
Ex: 3 3 7 J K
No: 3 3 3 7 J
No: 3 3 7 7 J
Decisions:
which kind for the pair? 13C1
which couple to use? 4C2
what kind for other 3? 12C3
for each, which suit? 4C1 for each!
P(pair) = ( 13C1 * 4C2 * 12C3 * (4C1)^3 ) / 52C5
a). what are mean and variance of depth?
20^2/25 - 7.5^2/25
## [1] 13.75
20^3/(12.5*3) - 7.5^3/(12.5*3)
## [1] 202.0833
202.08 - 13.75^2
## [1] 13.0175
Ad: shows up in nature allll the time!
Idea: Most of us are in the middle! More extreme is less likely.
Note: Tail area is non-negative allll the way down to +/- infty.
Fortunately, gets small quickly.
problem: f(x) is NOT INTEGRABLE BY HAND!
Need computational aid:
Problem! WHat are mu and sigma?
In general: from the problem.
Table: STANDARD NORMAL DIST!!!
mu = 0, sigma = 1
Ex). Suppose Z~normal(0,1)
(ie, std normal!)
a). P(Z < 2.76) = .9971
1-.9971
## [1] 0.0029
d). P( -0.25 < Z < 1.87 )
F(1.87) = .9693 F(-0.25) = .4013
.9693 - .4013
## [1] 0.568
Suppose Z has std normal dist.
a). How large would Z have to be in order to be at the 90th percentile?
z = 1.28
Note: won’t always be exact. Pick the closest.
b). How large would z have to be in order to be in the 40th percentile?
z = -.25
Problem: real world data NOT standard normal!
Solution: z-score!
z-scores CONVERT to std normal dist!
Ex). Height for adult women in US follows normal dist with mean 64”, stdev 2.4”.
a). What’s prob that rando woman is less than 65” tall?
P(X<65) = ?
z = (65 - 64)/2.4
(65 - 64)/2.4
## [1] 0.4166667
look up in table! ->. left area = .6628.
About a 66.3% chance.
b). What’s the prob that rando woman is at least 75” tall? z
(75 - 64)/2.4
## [1] 4.583333
uh-oh! off the charts!
area < .002
area ~ 0
First: find z-score from table!
halfway btwn: 1.64 and 1.65. average them! z = 1.645
now: solve for x!
1.645*2.4 + 64
## [1] 67.948
She’d have to be at least 67.948” tall.
normal dist in r.
Ex. If Z has std normal dist, Z~norm(0,1)
in r: “pnorm(zscore)”
pnorm(1.49)
## [1] 0.9318879
z = .25
in r: qnorm(percentile)
qnorm(.6)
## [1] 0.2533471
What’s the prob that rando man is btwn 66” and 70” tall?
find z-score!
x=70:
(70-69)/2.7
## [1] 0.3703704
x=66
(66-69)/2.7
## [1] -1.111111
P(-1.11 < Z < 0.37) =
pnorm(.37)
## [1] 0.6443088
pnorm(-1.11)
## [1] 0.1334995
area:
pnorm(.37)-pnorm(-1.11)
## [1] 0.5108092
About 51.1% of men are btwn 66” and 70”.
z-score:
lo
qnorm(.25)
## [1] -0.6744898
hi
qnorm(.75)
## [1] 0.6744898
Got z, now solve for x!
lo:
-.67*2.7+69
## [1] 67.191
hi:
.67*2.7+69
## [1] 70.809
The middle 50% of mens’ heights are btwn 67.2” and 70.8”.
IQR:
70.8-67.2
## [1] 3.6
Idea: for large n, then binomial looks just like normal!
Can be VERY CONVENIENT to replace complicated binomial with simple normal!
Ex) We roll a dice 100 times. What’s the prob that no more than 20 rolls show “4”?
Note: here, n is “large”!
mu
100*1/6
## [1] 16.66667
sigma
sqrt(100*1/6*5/6)
## [1] 3.72678
Find z-scores!
P(X<=20)
(20-16.667)/3.727
## [1] 0.8942849
area from table:
pnorm(.89)
## [1] 0.8132671
There’s about 81.3% chance of no more than 20 “4”s in 100 rolls.
b). What’s prob that we get EXACTLY 17 “4”s in 100 rolls?
X~norm(16.667, 3.727)
P(X=17) = 0
Fix: “continuity correction”
Fudge: P(16.5 < X < 17.5)
z-scores!
17.5:
(17.5-16.667)/3.727
## [1] 0.2235042
16.5:
(16.5-16.667)/3.727
## [1] -0.04480816
area:
pnorm(0.22)-pnorm(-0.04)
## [1] 0.1030179
There’s about a 10.3% chance of getting 17 “4”s in 100 rolls.
d). Use continuity correction to compute P(X<=20)
z-scores:
(20.5-16.667)/3.727
## [1] 1.028441
(-.5 - 16.667)/3.727
## [1] -4.606118
area:
pnorm(1.03) - pnorm(-4.61)
## [1] 0.848493
roll dice 100 times. count “4”s.
X~binom(100,1/6)
X~norm(16.667, 3.727)
P(X<=20)
exact:
pbinom(20,100,1/6)
## [1] 0.8481122
normal approx, no cont correction:
pnorm(20,16.667,3.727)
## [1] 0.8144153
with cont correction:
pnorm(20.5,16.667,3.727)-pnorm(-.5,16.667,3.727)
## [1] 0.8481268
Recall: X~pois(mu = avg rate of occurrences)
X = how many occurrences in unit of time/space
support: {0,1,2,…}
Now: X~exponential(lambda = avg rate of occurrences)
X = amount of time until the next occurrence
Ex). An urgent care center receives an average of 2.3 patients/hour.
a). What’s the prob that at least 2 patients arrive in the next hour?
X~pois(mu=2.3)
P(X>=2) = 1 - P(X<2) = 1 - P(0) - P(1)
1-exp(-2.3) - exp(-2.3)*2.3
## [1] 0.6691458
b). What’s the prob we wait less than 1.5 hours for the next patient?
X ~ exp(lambda = 2.3)
-exp(-2.3*1.5)+1
## [1] 0.9682544
Idea: what dist does our data have?
Care about: is our data NORMAL?
Look at histo!
Ex). iris. Q: is sepal length normal?
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
hist:
hist(iris$Sepal.Length)
Opinion: normal-ish. Unimodal, roughly symmetric.
x-y plot:
x = normal percentiles. (qnorm) y = original data
Ex: {5,6,7,8,9}
P10:
qnorm(.1)
## [1] -1.281552
P30:
qnorm(.3)
## [1] -0.5244005
And so on for P50, P70, P90.
Ex: {5,6,7,8,9}
qqnorm(c(5,6,7,8,9))
Ex). iris sepal length
hist(iris$Sepal.Length)
qqnorm(iris$Sepal.Length)
Q: is it linear?
A: sure, why not.
Ex). iris petal length
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
hist:
hist(iris$Petal.Length)
qqnorm(iris$Petal.Length)
Oof! Super not linear, data does -not- follow a normal dist.
So far: we’ve talked about INDIVIDUALS.
Ex). What’s prob that rando woman is at least 65” tall?
Now: want to talk about entire SAMPLE STATISTICS (think: xbar, phat).
Ex. In a sample of 10 women, what’s the prob their AVERAGE HEIGHT is at least 65” tall?
A SAMPLING DISTRIBUTION is a prob dist that describes the likelihood of entire SAMPLE RESULSTS (xbar, phat).
Ex) Roll 2 dice, take average = xbar. n = 2. Construct the samp dist for xbar.
CLT: As long as n is “large enough”, then the samp dist for xbar follows an approximately normal dist, NO MATTER WHAT POPULATION WE’RE SAMPLING FROM!!!!!
mean: “the mean of the mean is the mean” “our expectation for xbar is the same as the population mean, mu”
variance: cut down by factor of n
Ex) What’s prob that rando woman is at least 65” tall?
Recall: mu = 64, sigma = 2.4.
z-score:
(65-64)/2.4
## [1] 0.4166667
right area:
1-pnorm(.42)
## [1] 0.3372427
Ex) Sample n=10 women. WHat’s the prob their AVERAGE height is at least 65”?
xbar ~ normal
mu_xbar = 64 sigma_xbar = 2.4/sqrt(10)
z-score!
(65-64)/(2.4/sqrt(10))
## [1] 1.317616
right area:
1-pnorm(1.32)
## [1] 0.09341751
About 9.3% chance that 10 women have mean height at least 65”.
2 cases:
If we’re sampling from a normal population (think: height), then xbar is normal no matter what, even for small n.
If we’re sampling from a NON-NORMAL (or UNKNOWN) population, then NEED LARGE n in order to use normal dist (CLT).
"large enough": n >= 30Last: samp dist for xbar
Now: samp dist for phat (sample proportion)
If sample size (n) is large enough, then phat follows an (approximately) normal dist, NO MATTER WHAT POPULATION WE’RE SAMPLING FROM!!!
Large enough:
z = (obs - exp)/stdev
mean: expectation for sample proportion is the pop proportion!
Ex). Voters in a particular county are 55% R. 45% D.
If we take a sample of 100 people, what’s the prob that half or fewer are R?
P( phat <= .5 )
Is n large enough?
n*p:
100*.55
## [1] 55
n*(1-p)
100*(1-.55)
## [1] 45
normal dist! get z-score!
z = (obs - exp)/stdev
obs = .5
exp = .55
stdev = sqrt(.55*(1-.55)/100)
(.5 - .55)/sqrt(.55*(1-.55)/100)
## [1] -1.005038
find are on table:
pnorm(-1.01)
## [1] 0.1562476
About 15.6% chance that fewer than half in our sample are R.
Q: what the heck ARE all these parameters we’re talking about?
mu? sigma? p? lambda?
A: guess (estimate!) based on data!
Think:
Best guess for mu is: xbar!
Best guess for p: phat
Best guess for sigma: s (samp stdev)
Best guess for lambda (exponential dist): ???????
A POINT ESTIMATE is a single number representing a reasonable guess about a parameter.
Big question: how to find it? The “best guess” is not always obvious!
3 strategies:
Ex). Show that xbar is an unbiased estimator for mu.
Ex). Show that phat is unbiased estimator for p.
Practice: Show that s^2 is unbiased estimator for sigma^2 (solution: p253)
Example 6.12
2 kinds:
MoM: match up the moments! solve for parameters!
Note: how many moments do we need? answer: as many as the number of parameters.
Ex). Suppose x1, x2, …, xn are sampled from EXPONENTIAL distrubion. Use MoM to estimate lambda.
Ex). x1, … xn are sampled from NORMAL dist. Use MoM to estimate mu and sigma.
Likelihood function:
probability of observing OUR sample result (x1, x2, ..,xn) GIVEN the value of the params
MLE: find choice of params that MAXIMIZES (calculus!) the likelihood function
Ex). Use MLE to find estimate for lambda in exponential dist.
1). get joint pdf
3). take deriv, set =0. (ie find max!) [NOTE: take deriv with respect to LAMBDA!!!]
Ex). 6.22 (p273)
Find estimate for theat using a) MoM and b) MLE
Ex) Use MLE to estimate mu and sigma for normal dist.
Last time, for both unbiased AND M.O.M:
mu = xbar sigma = s
WARNING: must take deriv with respect to BOTH mu and sigma (not at the same time).
Idea: two equations, two unknowns!
Conclusion: MLE estimator for mu is the same (xbar), BUT the estimator for sigma is DIFFERENT! The MLE is BIASED!!!
Idea: what’s a RANGE of reasonable values for the parameter (mu, sigma, p, lambda, etc etc), how sure are we the true value is in there?
Ex). Height for adult men is normal with mean 69”, stdev 2.7”.
mu = 69, sigma = 2.7
If we take a sample of n=36 men, what are the cutoffs for the middle 95% of such sample means?
Q: is xbar normal?
A: two cases:
Middle 95% -> tail area = 2.5% -> z = +/- 1.96
qnorm(.025)
## [1] -1.959964
lo:
69-1.96*2.7/sqrt(36)
## [1] 68.118
hi:
69+1.96*2.7/sqrt(36)
## [1] 69.882
Idea: capture the middle 90%/95%/99% of the sampling distribution.
What are reasonable values of mu based on xbar?
Problem: what about sigma?
Temp solution: for large n (n>=30), s is a good approximation for sigma.
Ex) A biologist wonders: what’s the mean height of reticulated giraffes? Collect data on 40 giraffes, their mean height is 18.2’, stdev 1.7’.
Construct a 95% C.I for the true mean height of giraffes.
Here: critical value = z_alpha = 1.96
plug and chug!
low:
18.2 - 1.96*1.7/sqrt(40)
## [1] 17.67316
hi:
18.2 + 1.96*1.7/sqrt(40)
## [1] 18.72684
We are 95% confident that the true population mean of all reticulated giraffes is btwn 17.67 and 18.72 ft tall.
Is this the same as: ??
There’s a 95% chance that mu is btwn 17.67 and 18.72.
NO! To be continued…
ourData <- sample(1:6,30, replace=TRUE)
mean(ourData) - .67*sd(ourData)/sqrt(30)
## [1] 3.546203
mean(ourData) + .67*sd(ourData)/sqrt(30)
## [1] 3.920463
qnorm(.25)
## [1] -0.6744898
Idea: want best guess for pop parameter based on data!
ESTIMATE!!!!
Ex). Prof Miller wonders: what’s the true mean volume of coffee she pours in the morning?
Collects data on 50 days. Mean volume = 8.2, stdev 1.7oz.
Estimate the true mean volume with 99% confidence.
tail area = .005 -> 2.58
qnorm(.005)
## [1] -2.575829
plug and chug!
8.2 - 2.58*1.7/sqrt(50)
## [1] 7.579726
8.2 + 2.58*1.7/sqrt(50)
## [1] 8.820274
We are 99% confident that the true mean volume of morning coffee is between 7.58oz and 8.82oz.
Ex). Dice rolls! Goal: use sample data to approximate mean value of dice throw.
Note: we know that mu = 3.5
dice sample: (n=10)
sample(1:6, 10, replace = T)
## [1] 4 4 4 3 2 5 2 5 3 5
Take sample of size n=50, construct 50% CI:
ourData <- sample(1:6, 50, replace = T)
ourData
## [1] 2 5 6 6 2 4 6 5 1 4 2 6 2 3 3 3 4 6 5 4 5 4 1 2 4 5 4 6 5 5 2 2 1 3 5 2 5 2
## [39] 6 4 4 1 2 5 6 4 1 6 3 1
mean(ourData) - qnorm(.25)*sd(ourData)/sqrt(50)
## [1] 3.861522
mean(ourData) + qnorm(.25)*sd(ourData)/sqrt(50)
## [1] 3.538478
MORAL OF THE STORY: the cofindence level is about the METHOD used for constructing CI, it is NOT about the individual bounds you calculated.
You have NO IDEA if your interval was successful or not.
Math: everyting after the +/- in formula.
Goal: make MOE small! Two factors that affect MOE:
sample size (n). As sample size increases, the MOE for CI decreases
Think: yaaaay!
confidence level. As confidence increases, the MOE for CI BIGGER MOE!!!
Think: bummer!
Recall, the MOE in coffee examle was:
2.58*1.7/sqrt(50)
## [1] 0.6202741
Suppose we wish to conduct a follow-up study that estimates the true volume within 0.1 oz, at 99.9% confidence.
How large must our samle be?
Idea: plug in to MOE formula, solve back for n.
Info:
MOE < .1 z = 3.29
qnorm(.0005)
## [1] -3.290527
prior estimate for sigma = 1.7oz
plug in!
(3.29*1.7/.1)^2
## [1] 3128.165
Need n at least 3129.
Note: might need to compromise! Maybe either less confidence, OR accept a bigger margin of error.
So far: CI for mu (large n)
Q: wait. if we don’t know mu, how the heck do we know sigma?
A: swap out for s (ok if n large, since s is unbiased).
Q: what if n is small?
before:
z = (xbar - mu)/(sigma/sqrt(n))
now:
t = (xbar - mu)/(s/sqrt(n))
The t dist comes from using s instead of sigma.
Derivation: grad school!
Neat t facts:
Q: are t-intervals wider or narrower than z-intervals?
A: t always wider! more stuff in the tails!
Think: bummer! bigger MOE with t dist!
It’s the price we pay for not knowing sigma.
WARNING: t table is BACKWARDS from z table!
In the t-table:
Note: need DEGREES OF FREEDOM
df = n-1
Ex). Suppose we wish to estimate mu with 95% confidence from sample with n=15.
Note: it’s ESSENTIAL to use t here, not z!
t = 2.145
qt(.025,14)
## [1] -2.144787
Ex) Estimate mu with 99% confidence, n=20.
What’s t?
t = 2.861
Ex) A researcher wonders, what’s the mean height for women in Sweden?
Collect data from 12 women in Sweden, find mean height 65.2”, stdev 2.5”.
Estimate the true mean height with 90% confidence.
t = 1.796
65.2 - 1.796*2.5/sqrt(12)
## [1] 63.90385
65.2 + 1.796*2.5/sqrt(12)
## [1] 66.49615
We are 90% confident that true mean height for Swedish women is btwn 63.9” and 66.5”.
Ex). What proportion of DU students will travel out-of-state for break?
random sample of 100 students: 53 of them say “yes”, travel.
Estimate the true proportion of travelling students with 94% confidence.
qnorm(.03)
## [1] -1.880794
z = 1.88
plug and chug!
.53 - 1.88*sqrt(.53*(1-.53)/100)
## [1] 0.4361694
.53 + 1.88*sqrt(.53*(1-.53)/100)
## [1] 0.6238306
We’re 94% confident that the true prop of travelling students is btwn 44% and 62%.
Q: what if we wanted smaller MOE?
MOE:
1.88*sqrt(.53*(1-.53)/100)
## [1] 0.09383065
Ex). Suppose we wish to conduct a future study that estimate the prop to within 1% and 95% confidence.
Idea: plug in to MOE formula, solve back for n
prior estimate for p = .53
z = 1.96
MOE = .01
.53*(1-.53)*(1.96/.01)^2
## [1] 9569.426
Q: what if no prior estimate for p?
A: use = .5. Worst-case scenario.
Recall CLT: As long as n is large, then our sample statistics (like xbar and phat) follow a normal dist, NO MATTER WHAT POPULATION WE’RE SAMPLING FROM!!!
Ex). A researcher suspects that women in sweden have higher mean height then women in the US.
Data: a random sample of 50 swedish women have mean height of 65.1”, stdev 2.8”.
Does this provide evidence that mean height for swedish women is above 64” (mean for the US?)
H0: mu = 64 (no difference) Ha: mu > 64 (claim/suspicion being investigated)
z-score! (er, actually, the t score here, since we don’t know sigma)
t = (obs - exp)/stdev = (65.1 - 64)/(2.8/sqrt(50))
(65.1 - 64)/(2.8/sqrt(50))
## [1] 2.777919
p-val = probability of observing a sample result as (or more) extreme than ours, ASSUMING THE H0 IS TRUE.
Use “pt” command:
right tail:
1-pt(2.78, 49)
## [1] 0.003844915
On exam: use t table.
Interpret: If it’s true that mean height for swedish women is 64”, then the probability of observing a sample mean as big as ours (65.1”) is only 0.38%.
Note: looks like H0 and sample data DISAGREE!!
Since our p-val is small (p-val < .05), we REJECT H0. We found strong evidence that mean height for swedish women is greater than 64”.
Two big methods for statistical inference:
1). Hypotheses
We assume that the defendant is innocent (H0), BUT we think they’re guilty (Ha)!
H0: default assumption (no difference/ NOT GUILTY!!!) note: H0 always has “=”. (mu =…., p=….., sigma =…..)
Ha: out claim/suspicion/question (GUILTY!!) note: always inequality: >, <, !=
2).
In criminal trial: witness give testimony. In hyp test: data!
jury deliberates: is evidence convincing? hyp test: is our p-val small enough?
Note:
- small p-val are evidence AGAINST H0 - big p-val aren’t evidence of
anything
Are we convinced “beyond a shadow of a doubt”???
Hyp test: is p-val < significance level ??? (alpha=.05) If p<alpha -> reject H0. If p>=alpha -> fail to reject H0.
WARNING: each conclusion must have TWO SENTENCES:
Ex). A researcher suspects that instrumentalists (musicians) have a higher mean IQ than the general population (100).
Data: in a sample of 100 musicians, mean IQ of 103, stdev 15.2.
Does this provide strong evidence to support their suspicion?
Perform a complete hyp test.
H0: mu = 100 Ha: mu > 100
2). Test stat: (t/z score!)
(103 - 100)/(15.2/sqrt(100))
## [1] 1.973684
3). p-val
note: df = 100-1 = 99
Not on table! Use next smallest: df = 60.
tail area: btwn .025 and .05
.025 < p-val < .05
p-val = prob of getting a sample like ours, if H0 is true
If it’s really true that musicians have mean iq of 100, then there’s only a 2.5%-5% chance of observing a sample result as extreme as ours.
4). Since p-val < .05, we REJECT H0! We have strong evidence that mean iq for musicians is greater than 100.
Ex) A student suspects that fewer than half of DU students vote R.
To investigate, random sample of 60 DU students. Of these, 27 vote R.
Does the data support the claim? Hyp test!
H0: p = .5 Ha: p < .5
Note: always Z for proportions
z = (obs - exp)/stderr
(27/60 - .5)/sqrt(.5*(1-.5)/60)
## [1] -0.7745967
pnorm(-.77)
## [1] 0.2206499
p-val = .221
If it’s true that 50% of DU students vote R, then there’s a 22.1% chance of getting a sample result like ours.
4). Since p-val > .05, we fail to reject H0. We did not find strong evidence that fewer than half of DU students vote R.
Conclusion from test: not always true!
Ex). An EPA enforcement officer suspects that a local well water supply contains too much of Toxin A (the federally mandated level is 0.8 g/ml).
To investigate, she collects water samples (n=50), and computes average concentration of Toxin A.
Hyp test!
Explain, practically, what it would mean for her test to result in a Type I error? Type II error?
Two issues for each:
Type I: We conclude/think that the water is unsafe (too much toxin). BUT, actually the water conforms to safety standards (it’s fine).
Type II: We think the water’s fine, but actually it’s dangerous!
Arguments:
Type II is worse, since people drink dirty water!
Type I is worse, since employee could be fired! Type I worse, since wasting precious tax payer dollars!
Ex). Criminal trial!
Practically, what do type I and type II really mean?
H0: innocent Ha: guilty
Arguments:
Type I is worse since innocent people shouldn’t go to jail!
Type II is worse since, since a dangerous person could be set free!
Moral of the story: which is worse? IT DEPENDS!
Upshot: you (the investigator) can tip the scales!
Big idea: investigate a question/theses, using a dataset!
My recommendation: pick an ineresting (to you!) public dataset.
The format you NEED!
” .csv ”
Rectangular data!
Example: customer data:
Load data: Note: you need to tell R where the csv file is. I put mine in my downloads folder. If you are confused about this, get help in office hours!
myData <- read.csv("~/Downloads/Customers.csv")
head(myData)
## CustomerID Gender Age Annual.Income.... Spending.Score..1.100. Profession
## 1 1 Male 19 15000 39 Healthcare
## 2 2 Male 21 35000 81 Engineer
## 3 3 Female 20 86000 6 Engineer
## 4 4 Female 23 59000 77 Lawyer
## 5 5 Female 31 38000 40 Entertainment
## 6 6 Female 22 58000 76 Artist
## Work.Experience Family.Size
## 1 1 4
## 2 3 3
## 3 1 1
## 4 0 2
## 5 2 6
## 6 0 2
hist(myData$Spending.Score..1.100.)
Two tail tests:
Ha: !=
Ex) A researcher wonders if mean IQ for musicians DIFFERS from the general population (100).
Data: 50 musicians, their mean iq is 102.7, stdev 11.2.
Does this support claim/suspicion?
H0: mu = 100 Ha: mu != 100
t = (obs - exp)/stdev
(102.7-100)/(11.2/sqrt(50))
## [1] 1.704632
right tail area:
1-pt(1.70, 49)
## [1] 0.04773554
p-val = 2*.048 = .096
4). conclusion
Since p-val > .05, we fail to reject H0. There’s insufficient evidence that mean iq for musicians differs from 100.
Q: when to use one-tail vs two tail?
A:
Ex) A researcher suspects that a higher proportion of DU students are from out-of-state than for Oberlin.
Data:
1). Hypotheses
H0: p1 - p2 = 0 Ha: p1 - p2 > 0
(p1 = DU prop, p2 = Ob prop)
z = (obs - exp)/stderr
(51/60 - 58/80 - 0)/sqrt( 109/140*(1-109/140)/60 + 109/140*(1-109/140)/80)
## [1] 1.76279
1-pnorm(1.76)
## [1] 0.0392039
Interpret: If there’s no difference in proportion btwn DU and Ob, then there’s a 3.9% chance of observing a sample difference as extereme as ours.
4). Conclusion
Since p-val < .05, we reject H0. There’s strong evidence that the true proportion of out-of-state studetns at DU is higher than at Oberlin.
Ex) Prof Miller wonders, is there a difference in mean delivery time btwn Pizza Hut and Dominos?
Collect data about deliveries:
Of 23 PH deliveries, mean time was 23.4 min, stdev 6.3 min. Of 28 Dom deliveries, mean time was 27.9 min, stdev 9.4 min.
H0: mu1 - mu2 = 0 (no difference) Ha: mu1 - mu2 != 0 (yes difference)
t = (obs-exp)/stderr
(23.4 - 27.9 - 0)/sqrt( 6.3^2/23 + 9.4^2/28 )
## [1] -2.036769
Q: what’s df?
A: use the smaller of the two here: df = 23 - 1 = 22
left tail area:
pt(-2.04, 22)
## [1] 0.02676758
p-val = 2* .026 = .052. (two tail test!)
4). Conclusion
Since p-val > .05, we fail to reject H0. There’s not strong evidence of a difference in mean delivery time.
Note about p1-p2: don’t use pooled estimate for CI. WE’RE NOT ASSUMING p1 = p2 !!!!
H0: p1 - p2 = 0. <- default assumption in Hyp test
In confidence interval, no assumptions. Trying to estimate the difference!
Ex) A researcher wonders: is there a higher proportion of registered Republicans in Licking county and Franklin county?
Collect data:
Estimate the difference in proportion of R voters btwn the two counties, with 95% confidence. Crit val: z = 1.96
qnorm(.025)
## [1] -1.959964
low:
32/50 - 51/100 - 1.96*sqrt( 32/50*(1-32/50)/50 + 51/100*(1-51/100)/100 )
## [1] -0.03523393
hi:
32/50 - 51/100 + 1.96*sqrt( 32/50*(1-32/50)/50 + 51/100*(1-51/100)/100 )
## [1] 0.2952339
Interpret: we are 95% confident that the difference in prop of R voters is between -3.5% and 29.5%.
Since 0 is contained in the interval, there’s no strong evidence of a difference in proportion.
If the value in H0 is contained in the CI, then FAIL TO REJECT H0!
Otherwise, reject.
Ex) A reseracher wonders: is there a difference in mean ACT score btwn Licking county and Franklin county?
Data:
Estimate the difference in mean ACT score with 95% confidence. critical value: t = 2.01
qt(.025, 49)
## [1] -2.009575
low:
21.3-24.4 - 2.01*sqrt(5.4^2/50 + 6.1^2/80)
## [1] -5.157994
hi:
21.3-24.4 + 2.01*sqrt(5.4^2/50 + 6.1^2/80)
## [1] -1.042006
We’re 95% confident that the true difference in mean act score is btwn -5.2 and -1.0.
Looks like mu1-mu2=0 isn’t likely. We reject H0: there’s strong evidence of a difference in mean ACT score btwn Licking and Franklin.
Note: Hyp test and CI will always have same conclusion AS LONG AS ALPHA AND CONFIDENCE MATCH!!!
Ex: if alpha = .05, then use 95% confidence.
Ex: if alpha = .01, then use 90% confidence.
Idea: compare TWO quant variables. (x,y)
main visualization: “scatterplot”. x-y plot of all data points.
Ex). mpg
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
Make a scatterplot to visualize cty (x) and hwy (y)
plot(mpg$cty, mpg$hwy)
cor(mpg$cty, mpg$hwy)
## [1] 0.9559159
Woah! Strong relationship! Makes sense - if more efficient in city, you’d expect more efficient on hwy.
Ex) Compare engine size (displ - x) with cty mpg (cty - y )
plot(mpg$displ, mpg$cty)
cor(mpg$displ, mpg$cty)
## [1] -0.798524
Woah! Strong negative relationship! As engine size increases, mpg decreases. Makes sense.
Ex). iris data
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Compare iris sepal length (x) and sepal width (y).
plot(iris$Sepal.Length, iris$Sepal.Width)
cor(iris$Sepal.Length, iris$Sepal.Width)
## [1] -0.1175698
Huh. No obvious relationship btwn sepal length and width. Weird but true.
Note: capitol “R” is the software. lower. “r” is the correlation coefficient
Ex). mpg.
plot(mpg$cty, mpg$hwy)
cor(mpg$hwy, mpg$cty)
## [1] 0.9559159
r measures the STRENGTH and DIRECTION of a LINEAR relationship.
-1 <= r <= +1 for ANY variables x,yMany names:
Good news: all the same! One UNIQUE “best” line!
Q: What makes it the “best”?
A: it minimizes the sum of the SQUARE RESIDUALS!
Note: the sum/average of residuals is ALWAYS ZERO!!!
In linear model: we make predictions ABOUT y-variable, BASED UPON the x-variable.
x: input, independent variable y: output, dependedent variable
Be careful, it matters which is which!
Equation:
yhat = b0 + b1*x
where:
b0 = y-intercept b1 = slope
To compute:
b1 = r*sy/sx
Note: if correlation is STRONGER, then slope is STEEPER! Ie, if strong relationship, then x has bigger effect on y!
2). find intercept
Neat fact: the line ALWAYS touches the point (xbar,ybar)
->
b0 = ybar - b1*xbar
Ex). Data: height and weight of 50 men.
Summary:
Height:
avg height = 70.1” stdev. = 3.6”
Weight:
avg weight = 165.2 lbs stdev. = 11.9 lbs
cor = r = 0.71
Find equation for linear model to predict weight based on height.
Here: x=height, y = weight
.71*11.9/3.6
## [1] 2.346944
165.2 - 2.35*70.1
## [1] 0.465
yhat = 0.465 + 2.34*x
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
This is a good dataset: mixture/variety of BOTH categorical AND quantitative variables!
Ex). mpg. Find equation for linear model for predicting cty (y) mpg based on engine size (displ -x ).
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
Summary stats
xbar = 3.47 sx. = 1.29 ybar = 16.86 sy. = 4.26 r. = -0.80
-.8*4.26/1.29
## [1] -2.64186
16.86 + 2.64*3.47
## [1] 26.0208
yhat = 26.02 - 2.64*x
plot(mpg$displ, mpg$cty)
### Making prediction with least squares line
Plug in x!
Ex). A particular car has a 2.8L engine. Predict its cty mpg.
plug in x=2.8:
26.02 - 2.64*2.8
## [1] 18.628
plot(mpg$displ, mpg$cty)
Ex). Predict the cty mpg for a car with a 20.5L engine.
26.02 - 2.64*20.5
## [1] -28.1
Can’t! Shouldn’t! EXTRAPOLATION: making predictions outside the observed range of data. BAD!
Junk in - junk out!
Coeff: b0, b1
Slope: y=mx+b. If x increases by 1 unit, y inc/dec by m.
Lin reg: If [x-variable] increases by one [x-unit], then we predict/expect [y-variable] to inc/dec by [slope] [y-units].
For each additional L bigger a car’s engine is, we predict its city fuel efficiency to decrease by 2.64 mpg.
Intercept: if x=0, y = y-int
Lin reg: If [x-variable] is zero [x-units], then we predict/expect [y-variable] to be [y-int] [y units].
Car: If a car’s engine is 0L big (!!), we’d predict its city mpg to be 26.02 mpg.
EXTRAPOLATION! Not helpful.
r^2 = percentage of variation in the y-data that’s due to/because of the linear relationship btwn x and y.
Ex). compute and interpret r^2 for car model.
Of all the variation in city fuel efficiency, about 64% is due to the linear relationship between fuel efficiency and engine size.
compute:
(-.8)^2
## [1] 0.64
residual = y - yhat obs - exp
Ex). car model. A particular car has an 3.5 L engine and gets 17mpg in city. Compute the residual for this car.
x = 3.5 y = 17
yhat:
26.02 - 2.64*3.5
## [1] 16.78
resid:
17 - 16.78
## [1] 0.22
Our model under-approximated the mpg by 0.22mpg.
Residual plot:
x-coords: original x-data y-coords: residuals!
Ex). construct resid plot for mpg model.
x-coords = mpg$displ
first: predicted yhat:
yhat <- 26.02 - 2.64*mpg$displ
next: resids
resids <- mpg$cty - yhat
resid plot:
plot(mpg$displ, resids)
In R - “lm” command.
Ex). Find linear model for predicting cty mpg based on displ (engine size).
= “as a function of”
mpgModel <- lm( cty ~ displ, data = mpg)
Trick: look at variables with $
Summary:
summary(mpgModel)
##
## Call:
## lm(formula = cty ~ displ, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3109 -1.4695 -0.2566 1.1087 14.0064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.9915 0.4821 53.91 <2e-16 ***
## displ -2.6305 0.1302 -20.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.567 on 232 degrees of freedom
## Multiple R-squared: 0.6376, Adjusted R-squared: 0.6361
## F-statistic: 408.2 on 1 and 232 DF, p-value: < 2.2e-16
Residual plot:
x: original x data y: residuals (from model)
plot(mpg$displ, mpgModel$residuals)
### Interpreting residual plots
Here’s what you DON’T want to see:
1). A pattern (ex: curve-shaped). Idea: A pattern in the residuals means that the errors are PREDICTABLE. Suggests linear model isn’t the best.
Example:
Ex). Data = house sales. Make model to predict price based on sqft_living.
houseData <- read.csv("./House-Sales.csv")
houseModel <- lm( price ~ sqft_living, data = houseData)
residuals:
plot(houseData$sqft_living, houseModel$residuals)
Google image: “heteroscestic residuals”
In general: you HOPE your residuals are a “blob”. Best case scenario.
Ex) mpg. x=cty, y=hwy.
residuals:
mpgModel2 <- lm(hwy ~ cty, data = mpg)
plot(mpg$cty, mpgModel2$residuals)
Ah! No pattern, roughly homoskedastic. Good residuals!
Idea: we hope that residuals follow a normal dist.
Ie: most close to zero, unlikely to be large.
Ex). mpg. x=cty, y=hwy.
hist(mpgModel2$residuals)
Good resids! Roughly symmetric and unimodal!
Ex). mpg. x=displ, y=cty.
hist(mpgModel$residuals)
Hm. Strong right skew suggests linear model not the best.
Idea: more variables!
In R: use “+” in the lm command.
Ex). house sales. first: predict price based on bedrooms
houseModel1 <- lm(price~bedrooms, data=houseData)
summary(houseModel1)
##
## Call:
## lm(formula = price ~ bedrooms, data = houseData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3508706 -203228 -66728 104982 6839614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129649 8938 14.51 <2e-16 ***
## bedrooms 121790 2556 47.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349500 on 21611 degrees of freedom
## Multiple R-squared: 0.09507, Adjusted R-squared: 0.09503
## F-statistic: 2270 on 1 and 21611 DF, p-value: < 2.2e-16
Interpret slope: For each additional bedroom, we predict the house’s price to increase by $121,790.
y-int: If a house has zero bedrooms (hmm….), we predict its price to be 129,649. Extrapolation.
Note: r^2 = .095. Low! Seems like bedrooms alone don’t account for much of the price.
Ex 2). Predict price based on bedrooms and bathrooms.
houseModel2 <- lm(price ~ bedrooms + bathrooms, data = houseData)
summary(houseModel2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms, data = houseData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1484922 -184520 -42472 111528 5919466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30900 8276 -3.734 0.000189 ***
## bedrooms 20146 2666 7.557 4.27e-14 ***
## bathrooms 237934 3219 73.912 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 312200 on 21610 degrees of freedom
## Multiple R-squared: 0.2777, Adjusted R-squared: 0.2776
## F-statistic: 4154 on 2 and 21610 DF, p-value: < 2.2e-16
Ok, better! Now explaining 28% of sale price.
Ex3). Predict price based on bedrooms AND bathrooms AND sqft_living
houseModel3 <- lm(price~bedrooms+bathrooms+sqft_living, data=houseData)
summary(houseModel3)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = houseData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1644794 -144361 -22891 102420 4178611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74662.099 6917.977 10.792 <2e-16 ***
## bedrooms -57906.631 2336.062 -24.788 <2e-16 ***
## bathrooms 7928.708 3512.744 2.257 0.024 *
## sqft_living 309.605 3.089 100.238 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 258000 on 21609 degrees of freedom
## Multiple R-squared: 0.5069, Adjusted R-squared: 0.5069
## F-statistic: 7406 on 3 and 21609 DF, p-value: < 2.2e-16
Best yet! This model explains over 50% of the variation in the y variable.
Idea for multiple variables: hunt for the variables that give highest R^2 value!
Idea: why not use many x variables!
Q: Which x variables to use?
1 - Choose variables that give highest R^2. More predictive power!
2 - Occam’s Razor: the simplest explanation is the best! For multiple regression: hope to use as few variable as possible.
Which variables to boot out of our model? Don’t use variables with high correlation! If strongly correlated, then using two doens’t give more predictive power than just one!
Useful tool in R: ggcorr
Ex). Use ggcorr to summarize correlation coefficients in iris dataset.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Library: GGally
library(GGally)
ggcorr(iris)
Hm. Maybe don’t use BOTH Petal Width AND Petal Length in the same model.
Highly correlated! Not adding to the model.
Resid plot:
x: original x data y: residuals
Hey wait! If multiple x-variables, which to use?
Idea: for multiple regression residuals:
x: “fitted values” (yhat) y: residuals (y-yhat)
Ex). House sales. Construct linear model to predict price based on bedrooms, bathrooms, sqft_living. Make resid plot.
residExampleModel <- lm( price ~ bedrooms + bathrooms + sqft_living, data = houseData)
Resid plot:
x: fitted values y: resids
plot(residExampleModel$fitted.values, residExampleModel$residuals)
ggcorr(houseData)
Neat R trick: autoplot
library: ggfortify
library(ggfortify)
autoplot(residExampleModel)
interpret: upper left: strongly heteroskedastic. bad! upper right:
deviating from normal dist. bad!
Linear regression: needs TWO quantitative variables.
plot(iris$Sepal.Length, iris$Sepal.Width)
Ex). iris. Use species to predict sepal length.
sepal length ~ species
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
lm:
speciesModel <- lm(Sepal.Length ~ Species, data = iris)
summary(speciesModel)
##
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6880 -0.3285 -0.0060 0.3120 1.3120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
## Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
## Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
## F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
Wait! WHat happened to setosa?
A: one category gets “absorbed” into the y-int
Summary: when we predict on a category, our prediction is just the MEAN VALUE for that category!
Useful: combine it in a multiple regression model!
Ex). Iris: predict sepal length based on sepal width, petal lenth, petal width, species!
multiCategoryModel <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris)
summary(multiCategoryModel)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82816 -0.21989 0.01875 0.19709 0.84570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
## Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
## Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
## Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared: 0.8586, Adjusted R-squared: 0.8557
## F-statistic: 295.5 on 3 and 146 DF, p-value: < 2.2e-16
Inference: hyp tests and conf ints
Q: is r “big enough”?
A: hyp test!
Hyp Test for rho.
Hypotheses:
H0: rho = 0 (no lin rel) Ha: rho != 0 (yes, some relationship)
Warning: If reject H0, we only conclude there’s SOME linear relationship. Doesn’t necessarily mean STRONG relationship!
Ex) mpg. Predict cty mpg based on displ.
plot(mpg$displ, mpg$cty)
1) Hypotheses
H0: rho = 0 Ha: rho != 0
Cor coeff:
cor(mpg$displ, mpg$cty)
## [1] -0.798524
Sample size:
length(mpg$displ)
## [1] 234
plug in:
-.8*sqrt(234-2)/sqrt(1-(-.8)^2)
## [1] -20.30873
3). p-val
recall: cdf for t dist:
pt(-20.3, 232)
## [1] 1.181523e-53
Interpret: if it’s true that there’s no linear relationship btwn engine size (displ) and city mpg, then there’s an almost zero chance of observing a sample correlation as strong as ours.
4). conclusion
Since p-val <<< .05, we reject H0. There’s strong evidence to suggest a linear relationship btwn engine size and fuel efficiency.
Caveat: we are NOT concluding anything about STRENGTH of this relationship, nor that linear is the best model.
lm(mpg$cty~mpg$displ)
##
## Call:
## lm(formula = mpg$cty ~ mpg$displ)
##
## Coefficients:
## (Intercept) mpg$displ
## 25.99 -2.63
In r, to find std err for slope, just look at lm:
summary(lm(mpg$cty~mpg$displ))
##
## Call:
## lm(formula = mpg$cty ~ mpg$displ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3109 -1.4695 -0.2566 1.1087 14.0064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.9915 0.4821 53.91 <2e-16 ***
## mpg$displ -2.6305 0.1302 -20.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.567 on 232 degrees of freedom
## Multiple R-squared: 0.6376, Adjusted R-squared: 0.6361
## F-statistic: 408.2 on 1 and 232 DF, p-value: < 2.2e-16
Ex). Estimate the population slope for the model (x=displ, y=cty) with 95% confidence.
b1 = -2.631 sb = 0.130 n = 234
critical value: t = 1.970
qt(.025, 232)
## [1] -1.970242
plug and chug!
lo:
-2.631 - 1.970*0.130
## [1] -2.8871
hi:
-2.631 + 1.970*0.130
## [1] -2.3749
We’re 95% confident that true population slope btwn displ and cty mpg is btwn -2.887 mpg and -2.375 mpg.
Ie, for each addional liter bigger a car’s engine is, we’d expect the cty mpg to decrease by between 2.887 and 2.375 mpg.
Idea: do two (or more) different distributions match up?
Ex). A cognitive psychologist wonders: do babies prefer a particular color of toy?
To investigate, 100 babies are presented with 4 identical toys: one blue, one yellow, one green, one red.
Results: 35 babies pick red, 15 pick yellow, 20 pick blue, and 30 pick green.
Q: is there evidence for a preference of one color? Q: is there evidence that not all colors equally likely?
If preference is equaly, how many would we expect for each color? 25!
Test:
1). Hypotheses
H0: Ha:
2). Test stat
chisq = (35-25)^2/25 + (15 - 25)^2/25 + (20-25)^2/25 + (30-25)^2/25
(35-25)^2/25 + (15 - 25)^2/25 + (20-25)^2/25 + (30-25)^2/25
## [1] 10
chisq = 10.000
df = #categories - 1
3). p-val
1-pchisq(10.000, 3)
## [1] 0.01856614
p-val = .019
If it’s true that all colors are equally likely to be chosen, then there’s a 1.9% chance of observing a sample result like ours.
4). Conclusion
Since p-val < .05, we reject H0. Found strong evidence that not all colors equally likely to be chosen.
Q: is there statistically sig evidence of a difference in proportion of cars that have 6 cyl btwn compact and midsize cars?
In R: table command.
addmargins( table(mpg$cyl, mpg$class) )
##
## 2seater compact midsize minivan pickup subcompact suv Sum
## 4 0 32 16 1 3 21 8 81
## 5 0 2 0 0 0 2 0 4
## 6 0 13 23 10 10 7 16 79
## 8 5 0 2 0 20 5 38 70
## Sum 5 47 41 11 33 35 62 234
H0: p1 - p2 = 0 Ha: p1 - p2 != 0
pooled proportion: 36/88
(13/47-23/41)/sqrt(36/88*(1-36/88)/47 + 36/88*(1-36/88)/41)
## [1] -2.706625
2*pnorm(-2.707)
## [1] 0.006789426
4). Conclusion
Since p-val < .05, we reject H0. We’ve found strong evidence of a difference in proportion of 6-cyl cars btwn compact and midsize cars.
“Stacked” bar graph
Ex). bar graph of cyl.
myTable <- table(mpg$cyl, mpg$class)
barplot(myTable, beside=T)
ANOVA = “analysis of variance”
Idea: Is there evidence of a difference in means (MORE THAN 2 GROUPS!)
Ex). Iris. Is there evidence of a difference in mean sepal length between the 3 species: setosa, versicolor, virginica?
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Visualization: boxplot
boxplot(iris$Sepal.Length ~ iris$Species, horizontal = T)
H0: means are equal (no differnce) Ha: at least one differs
Idea: F = (variation between groups) / (variation WITHIN groups)
compute in r:
irisAnovaModel <- aov(iris$Sepal.Length ~ iris$Species)
summary(irisAnovaModel)
## Df Sum Sq Mean Sq F value Pr(>F)
## iris$Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test stat = F = 119.3
3). p-val
p-val ~ 0
4). Conclusion
Since p-val < .05, we reject H0. We’ve found strong evidence of a difference in mean sepal length between the three species.
boxplot(iris$Sepal.Length~iris$Species, horizontal = T)
Idea: is there evidence of a relationship between two CATEGORICAL variables?
Ex). mpg
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
H0: cyl and class are independent Ha: there’s some relationship btwn cyl and class
Compute expected counts for all cyl/class combiniations.
Idea: If A,B independent, then P(A and B) = P(A)*P(B)
In r: chisq.test()
chisq.test(mpg$cyl, mpg$class)
##
## Pearson's Chi-squared test
##
## data: mpg$cyl and mpg$class
## X-squared = 138.03, df = 18, p-value < 2.2e-16
ourModel <- chisq.test(mpg$cyl, mpg$class)
ourModel$observed
## mpg$class
## mpg$cyl 2seater compact midsize minivan pickup subcompact suv
## 4 0 32 16 1 3 21 8
## 5 0 2 0 0 0 2 0
## 6 0 13 23 10 10 7 16
## 8 5 0 2 0 20 5 38
ourModel$expected
## mpg$class
## mpg$cyl 2seater compact midsize minivan pickup subcompact
## 4 1.73076923 16.2692308 14.1923077 3.8076923 11.4230769 12.1153846
## 5 0.08547009 0.8034188 0.7008547 0.1880342 0.5641026 0.5982906
## 6 1.68803419 15.8675214 13.8418803 3.7136752 11.1410256 11.8162393
## 8 1.49572650 14.0598291 12.2649573 3.2905983 9.8717949 10.4700855
## mpg$class
## mpg$cyl suv
## 4 21.461538
## 5 1.059829
## 6 20.931624
## 8 18.547009
chisq = 138.03
p-val ~ 0
Since p-val < .05, we reject H0. There’s strong evidence of a relationship btwn cyl and class.
Visualization: stacked bar plot
mpgTable <- table(mpg$cyl, mpg$class)
mpgTable
##
## 2seater compact midsize minivan pickup subcompact suv
## 4 0 32 16 1 3 21 8
## 5 0 2 0 0 0 2 0
## 6 0 13 23 10 10 7 16
## 8 5 0 2 0 20 5 38
barplot(mpgTable, beside=T, legend=unique(mpg$cyl))